balance Quickstart: Analyzing and adjusting the bias on a simulated toy dataset¶

'balance' is a Python package that is maintained and released by the Core Data Science Tel-Aviv team in Meta. 'balance' performs and evaluates bias reduction by weighting for a broad set of experimental and observational use cases.

Although balance is written in Python, you don't need a deep Python understanding to use it. In fact, you can just use this notebook, load your data, change some variables and re-run the notebook and produce your own weights!

This quickstart demonstrates re-weighting specific simulated data, but if you have a different usecase or want more comprehensive documentation, you can check out the comprehensive balance tutorial.

Analysis¶

There are four main steps to analysis with balance:

  • load data
  • check diagnostics before adjustment
  • perform adjustment + check diagnostics
  • output results

Let's dive right in!

Example dataset¶

The following is a toy simulated dataset.

In [1]:
import pandas as pd
import numpy as np 

np.random.seed(2022-11-8) # for reproducibility
n_target = 10000
target_df = pd.DataFrame(
        {
            "gender": np.random.choice(["Male", "Female"], size = n_target, replace = True, p= [.5,.5]),
            "age_group": np.random.choice(["18-24","25-34", "35-44", "45+"], size = n_target, replace = True, p= [.20, .30, .30,.20]),
            "income": np.random.normal(3, 2, size = n_target)**2,
            # "unrelated_variable": np.random.uniform(size = n_target),
            "id": (np.array(range(n_target)) + 100000).astype(str),
            # "weight": np.random.uniform(size = n_target) + 0.5,
        }
    )
target_df.happiness = np.random.normal(50, 10, size = n_target) + \
                      np.where(target_df.gender == "Female", 1, 0) * np.random.normal(5, 2, size = n_target)
# We also have missing values in gender
target_df.loc[3:900, "gender"] = np.nan

n_sample = 1000
sample_df = pd.DataFrame(
        {
            "gender": np.random.choice(["Male", "Female"], size = n_sample, replace = True, p= [.7,.3]),
            "age_group": np.random.choice(["18-24","25-34", "35-44", "45+"], size = n_sample, replace = True, p= [.50, .30, .15,.05]),
            "income": np.random.normal(2, 1.5, size = n_sample)**2,
            # "unrelated_variable": np.random.uniform(size = n_sample),
            "id": (np.array(range(n_sample))).astype(str),
            # "weight": np.random.uniform(size = n_sample) + 0.5,
        }
    )
# females are happier
# older people are happier
# people with higher income are happeir
sample_df["happiness"] = np.random.normal(40, 10, size = n_sample) + \
                      np.where(sample_df.gender == "Female", 1, 0) * np.random.normal(20, 1, size = n_sample) + \
                      np.where(sample_df.age_group == "35-44", 1, 0) * np.random.normal(5, 1, size = n_sample) + \
                      np.where(sample_df.age_group == "45+", 1, 0) * np.random.normal(20, 1, size = n_sample) + \
                      np.random.normal((np.random.normal(3, 2, size = n_sample)**2) / 20, 1, size = n_sample) 
# Truncate for max to be 100
sample_df["happiness"] = np.where(sample_df["happiness"] < 100, sample_df["happiness"], 100)

# We also have missing values in gender
sample_df.loc[3:90, "gender"] = np.nan
/tmp/ipykernel_269/5624192.py:16: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
  target_df.happiness = np.random.normal(50, 10, size = n_target) + \

In practice, one can use pandas loading function(such as read_csv()) to import data into the DataFrame objects sample_df and target_df.

Load data¶

The first thing to do is to import the Sample class from balance. All of the data we're going to be working with, sample or population, will be stored in objects of the Sample class.

In [2]:
from balance import Sample
INFO (2022-11-12 23:06:11,727) [__init__/<module> (line 52)]: Using balance version 0.1.0

Using the Sample class, we can fill it with a "sample" we want to adjust, and also a "target" we want to adjust towards.

We turn the two input pandas DataFrame objects we created (or loaded) into a balance.Sample objects, by using the .from_frame()

In [3]:
sample = Sample.from_frame(sample_df, outcome_columns=["happiness"])
target = Sample.from_frame(target_df)
WARNING (2022-11-12 23:06:12,625) [util/guess_id_column (line 100)]: Guessed id column name id for the data
WARNING (2022-11-12 23:06:12,637) [sample_class/from_frame (line 235)]: No weights passed, setting all weights to 1
WARNING (2022-11-12 23:06:12,662) [util/guess_id_column (line 100)]: Guessed id column name id for the data
WARNING (2022-11-12 23:06:12,707) [sample_class/from_frame (line 235)]: No weights passed, setting all weights to 1

If we use the .df property call, we can see the DataFrame stored in sample. We can see how we have a new weight column that was added (it will all have 1s) in the importing of the DataFrames into a balance.Sample object.

In [4]:
sample.df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   id         1000 non-null   object 
 1   gender     912 non-null    object 
 2   age_group  1000 non-null   object 
 3   income     1000 non-null   float64
 4   happiness  1000 non-null   float64
 5   weight     1000 non-null   int64  
dtypes: float64(2), int64(1), object(3)
memory usage: 47.0+ KB

We can get a quick overview text of each Sample object, but just calling it.

Let's take a look at what this produces:

In [5]:
sample
Out[5]:
(balance.sample_class.Sample)

        balance Sample object
        1000 observations x 3 variables: gender,age_group,income
        id_column: id, weight_column: weight,
        outcome_columns: happiness
        
In [6]:
target
Out[6]:
(balance.sample_class.Sample)

        balance Sample object
        10000 observations x 3 variables: gender,age_group,income
        id_column: id, weight_column: weight,
        outcome_columns: None
        

Next, we combine the sample object with the target object. This is what will allow us to adjust the sample to the target.

In [7]:
sample_with_target = sample.set_target(target)

Looking on sample_with_target now, it has the target atteched:

In [8]:
sample_with_target
Out[8]:
(balance.sample_class.Sample)

        balance Sample object with target set
        1000 observations x 3 variables: gender,age_group,income
        id_column: id, weight_column: weight,
        outcome_columns: happiness
        
            target:
                 
	        balance Sample object
	        10000 observations x 3 variables: gender,age_group,income
	        id_column: id, weight_column: weight,
	        outcome_columns: None
	        
            3 common variables: income,gender,age_group
            

Pre-Adjustment Diagnostics¶

We can use .covars() and then followup with .mean() and .plot() (barplots and qqplots) to get some basic diagnostics on what we got.

We can see how:

  • The proportion of missing values in gender is similar in sample and target.
  • We have younger people in the sample as compared to the target.
  • We have more females than males in the sample, as compared to around 50-50 split for the (non NA) target.
  • Income is more right skewed in the target as compared to the sample.
In [9]:
sample_with_target.covars().mean()
Out[9]:
_is_na_gender[T.True] age_group[T.25-34] age_group[T.35-44] age_group[T.45+] gender[Female] gender[Male] gender[_NA] income
source
self 0.0880 0.3090 0.1720 0.0460 0.2680 0.6440 0.0880 5.991020
target 0.0898 0.2974 0.2992 0.2063 0.4551 0.4551 0.0898 12.737608
In [10]:
sample_with_target.covars().plot()
/home/scubasteve/projects/venv/lib/python3.9/site-packages/balance/stats_and_plots/weighted_stats.py:473: FutureWarning:

The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.

/home/scubasteve/projects/venv/lib/python3.9/site-packages/balance/stats_and_plots/weighted_stats.py:473: FutureWarning:

The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.

/home/scubasteve/projects/venv/lib/python3.9/site-packages/balance/stats_and_plots/weighted_stats.py:473: FutureWarning:

The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.

/home/scubasteve/projects/venv/lib/python3.9/site-packages/balance/stats_and_plots/weighted_stats.py:473: FutureWarning:

The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.

Adjusting Sample to Population¶

Next, we adjust the sample to the target. The default method to be used is 'ipw' (which uses inverse probability/propensity weights, after running logistic regression with lasso regularization).

In [11]:
# Using ipw to fit survey weights
adjust = sample_with_target.adjust(max_de = None)
INFO (2022-11-12 23:06:21,548) [ipw/ipw (line 380)]: Starting ipw function
INFO (2022-11-12 23:06:21,577) [adjustment/apply_transformations (line 221)]: Adding the variables: []
INFO (2022-11-12 23:06:21,591) [adjustment/apply_transformations (line 222)]: Transforming the variables: ['income', 'gender', 'age_group']
/home/scubasteve/projects/venv/lib/python3.9/site-packages/balance/util.py:1219: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/home/scubasteve/projects/venv/lib/python3.9/site-packages/balance/util.py:1219: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

INFO (2022-11-12 23:06:21,617) [adjustment/apply_transformations (line 252)]: Final variables in output: ['income', 'gender', 'age_group']
INFO (2022-11-12 23:06:21,637) [ipw/ipw (line 414)]: Building model matrix
INFO (2022-11-12 23:06:21,883) [ipw/ipw (line 436)]: The formula used to build the model matrix: ['income + gender + age_group + _is_na_gender']
INFO (2022-11-12 23:06:21,884) [ipw/ipw (line 439)]: The number of columns in the model matrix: 16
INFO (2022-11-12 23:06:21,885) [ipw/ipw (line 440)]: The number of rows in the model matrix: 11000
INFO (2022-11-12 23:06:21,914) [ipw/ipw (line 471)]: Fitting logistic model
INFO (2022-11-12 23:06:24,966) [ipw/ipw (line 543)]: Chosen lambda for cv: [0.01386604]
INFO (2022-11-12 23:06:24,968) [ipw/ipw (line 551)]: Proportion null deviance explained [0.17374302]

Evaluation of the Results¶

We can get a basic summary of the results:

In [12]:
print(adjust.summary())
Covar ASMD reduction: 62.3%, design effect: 2.249
Covar ASMD (7 variables):0.335 -> 0.126
Model performance: Model proportion deviance explained: 0.174

We see an improvement in the average ASMD. We can look at detailed list of ASMD values per variables using the following call.

In [13]:
adjust.covars().asmd()
Out[13]:
age_group[T.25-34] age_group[T.35-44] age_group[T.45+] gender[Female] gender[Male] gender[_NA] income mean(asmd)
source
self 0.040094 0.019792 0.137361 0.089228 0.061820 0.047739 0.246918 0.126310
unadjusted 0.025375 0.277771 0.396127 0.375699 0.379314 0.006296 0.517721 0.334860
unadjusted - self -0.014719 0.257980 0.258765 0.286472 0.317494 -0.041444 0.270802 0.208551

It's easier to learn about the biases by just running .covars().plot() on our adjusted object.

In [14]:
adjust.covars().plot()
/home/scubasteve/projects/venv/lib/python3.9/site-packages/balance/stats_and_plots/weighted_stats.py:473: FutureWarning:

The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.

/home/scubasteve/projects/venv/lib/python3.9/site-packages/balance/stats_and_plots/weighted_stats.py:473: FutureWarning:

The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.

/home/scubasteve/projects/venv/lib/python3.9/site-packages/balance/stats_and_plots/weighted_stats.py:473: FutureWarning:

The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.

/home/scubasteve/projects/venv/lib/python3.9/site-packages/balance/stats_and_plots/weighted_stats.py:473: FutureWarning:

The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.

/home/scubasteve/projects/venv/lib/python3.9/site-packages/balance/stats_and_plots/weighted_stats.py:473: FutureWarning:

The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.

/home/scubasteve/projects/venv/lib/python3.9/site-packages/balance/stats_and_plots/weighted_stats.py:473: FutureWarning:

The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.

We can also use different plots, using the seaborn library, for example with the "kde" dist_type.

In [15]:
# This shows how we could use seaborn to plot a kernel density estimation
adjust.covars().plot(library = "seaborn", dist_type = "kde")
Out[15]:
array([<AxesSubplot: title={'center': "distribution plot of covar 'income'"}, xlabel='income', ylabel='Density'>,
       <AxesSubplot: title={'center': "barplot of covar 'gender'"}, xlabel='gender', ylabel='prop'>,
       <AxesSubplot: title={'center': "barplot of covar 'age_group'"}, xlabel='age_group', ylabel='prop'>],
      dtype=object)
In [16]:
adjust.covars().df.index
Out[16]:
RangeIndex(start=0, stop=1000, step=1)

Understanding the weights¶

We can also at the distribution of weights using the following call.

In [17]:
adjust.weights().plot()
/home/scubasteve/projects/venv/lib/python3.9/site-packages/seaborn/distributions.py:306: UserWarning:

Dataset has 0 variance; skipping density estimate.

Out[17]:
[<AxesSubplot: title={'center': "distribution plot of covar 'weight'"}, xlabel='weight', ylabel='Density'>]

And get the design effect using:

In [18]:
adjust.weights().design_effect()
Out[18]:
2.2493789945583806

Outcome analysis¶

In [19]:
print(adjust.outcomes().summary())
WARNING (2022-11-12 23:06:27,137) [balancedf_class/target_response_rates (line 1290)]: Sample does not have target set
1 outcomes: ['happiness']
Mean outcomes:
            happiness
source               
self        54.221388
unadjusted  48.392784

Response rates (relative to number of respondents in sample):
   happiness
n     1000.0
%      100.0



The estimated mean happiness according to our sample is 48 without any adjustment and 54 with adjustment. The following show the distribution of happinnes:

In [20]:
adjust.outcomes().plot()  # dist_type = "kde"
Out[20]:
[<AxesSubplot: title={'center': "distribution plot of covar 'happiness'"}, xlabel='happiness', ylabel='Density'>]

Downloading data¶

Finally, we can prepare the data to be downloaded for future analyses.

In [21]:
adjust.to_download()
Out[21]:
Click here to download: /tmp/tmp_balance_out_0e5ac5bb-d045-4823-aa50-a9e8291ecb3e.csv
In [22]:
# We can prepare the data to be exported as csv - showing the first 500 charaacters for simplicity:
adjust.to_csv()[0:500]
Out[22]:
'id,gender,age_group,income,happiness,weight\n0,Female,25-34,1.0384632065106263,55.9757637459091,10.388001723788236\n1,Male,45+,0.21460348629627557,58.645154141877306,21.221032388150093\n2,Male,35-44,2.3221372597327745,42.28565260361035,9.106922937954346\n3,,18-24,0.08606802599581903,49.21098472260077,5.439376521985937\n4,,35-44,17.156958197550072,49.33084508361814,31.00026920172074\n5,,35-44,4.257130738748466,67.46904416515795,19.23721007589055\n6,,25-34,1.0092927734150772,40.03387164850365,9.246577762'